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^ ■ Abstract 

Revealing the clonal composition of a single tumor is essential for identifying cell subpopulations with 
metastatic potential in primary tumors or with resistance to therapies in metastatic tumors. Sequencing 
technologies provide an overview of an aggregate of numerous cells, rather than subclonal-specific quan- 
tification of aberrations such as single nucleotide variants (SNVs). Computational approaches to de-mix 
a single collective signal from the mixed cell population of a tumor sample into its individual components 
are currently not available. 

Herein we propose a framework for deconvolving data from a single genome- wide experiment to infer 
the composition, abundance and evolutionary paths of the underlying cell subpopulations of a tumor. 
The method is based on the plausible biological assumption that tumor progression is an evolutionary 
,-0 ' process where each individual aberration event stems from a unique subclone and is present in all its 

descendants subclones. We have developed an efficient algorithm (TrAp) for solving this mixture problem. 
In silico analyses show that TrAp correctly deconvolves mixed subpopulations when the number of 
subpopulations and the measurement errors are moderate. We demonstrate the applicability of the 
method using tumor karyotypes and somatic hypermutation datasets. We applied TrAp to SNV frequency 
profile from Exome-Seq experiment of a renal cell carcinoma tumor sample and compared the mutational 
\Q ' profile of the inferred subpopulations to the mutational profiles of twenty single cells of the same tumor. 

Despite the large experimental noise, specific co-occurring mutations found in clones inferred by TrAp 
are also present in some of these single cells. Finally, we deconvolve Exome-Seq data from three distinct 
metastases from different body compartments of one melanoma patient and exhibit the evolutionary 
' relationships of their subpopulations. 



Introduction 



The mechanisms of cancer evolution and metastatic onset are still largely unknown. The diversity, 
complexity and evasive nature of tumor biology are central reasons for the seemingly slow progress in 
the cure of most cancer types, particularly in controlling the ability of tumor populations to spread. 
Tumor populations are dynamic aggregates of constantly evolving subclones, each carrying a variety 
of genomic aberrations [1-16]. This genetic heterogeneity is often associated with differences in the 
biological behavior of different cell subpopulations. Some of these subclones are likely to be the primary 
instigators of invasion, metastases or relapse following treatment [11,17-20]. An effective characterization 
of the aggressive potential of tumors at early stages has an enormous potential to guide new clinical 
interventions and translational research [21]. 

Recently, several efforts have been made to provide a complete genealogical perspective of cancer 
evolution [22-26]. Using fluorescent labeling techniques, or more recently, single cell sequencing, it is 
technically possible to separate single cells from tumor samples in order to investigate their evolutionary 
patterns [22-31]. However, these approaches are limited to either a small number of fluorescent markers 
[23,32], or to a relatively small number of single cells. On one hand, the identification and selection 
of uncharacterized subclones in high-throughput experiments is beyond the capabilities of current cell- 
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sorting technologies; on the other hand, isolation and profiling of enough single cells in order to obtain 
a statistically representative sample of a tumor composed of millions of cells has, currently, prohibitive 
costs. For this reason, genomics profiling of tumors still relies on pooling in order to provide global 
averaged signals over the subclonal population within a tumor sample [33-36] . Computational methods for 
identifying subclones, quantifying their relative abundance and monitoring their emergence and dynamics 
could prove extremely useful for the analysis of the heterogeneity of these pooled samples. This problem 
has been often overlooked due to its mathematical complexity. 

We herein present a mathematical approach to de-mix signals from heterogeneous cell populations into 
their subclonal components and subsequently unveil the underlying dynamic tumor heterogeneity. Our 
proposed method is related to the problem of blind source separation [37-44], where both the underlying 
sources and their relative composition are unknown. In contrast to blind source separation methods we 
cannot assume that the underlying sources are statistically independent, we have no prior knowledge of 
the number of sources and we have at our disposal only one mixture of the unknown sources. This math- 
ematical problem has a vast number of solutions and can be addressed only if additional constraints are 
imposed. Hence, we assume that tumor cell populations develop in a parsimonious evolutionary process. 
Furthermore, based on empirical observations, we introduce a sparsity constraint that limits the number 
of subpopulations. Distinctively from the standard problem of phylogeny [45-56], where each species is 
observed and measured separately, our methodology, which we term Tree Approach to Clonality (TrAp), 
is specifically designed to deconvolve a single aggregate signal into its different subclonal components. 
TrAp incorporates biologically motivated constraints which allow us to infer: 1) the composition of the 
subclones in a single aggregate sample 2) the abundance of each subclone and 3) the evolutionary path 
of the subclones. 

The paper is organized as follows: we first define the subclonal deconvolution problem and 

present an efficient algorithm for finding all its solutions. Using in silico simulated data we verify that 
the algorithm is able to correctly deconvolve mixed subclonal populations and that the method is robust 
to realistic measurement errors. Further, the solution is often unique when the number of populated 
subclones is moderate. In addition, we also show that TrAp can correctly deconvolve random mixtures 
of karyotypes of several cells from the same tumor biopsy or from mixture of sequences generated in a 
study involving somatic hypermutations in B cells. We subsequently compare the mutation profiles of 
twenty Exome-Seq single cell experiments to those inferred using an aggregate signal generated by exome 
sequencing from the same renal cell carcinoma tumor. Finally, we apply TrAp to Exome-Seq data from 
three metastases from three distinct body compartments and compare their subclonal compositions and 
evolutionary histories. 

Results 

The results are divided in four parts. In the first part we describe our novel TrAp algorithm for subclonal 
deconvolution of aggregate genomic signals consisting of aberrations' frequencies and we show that the 
TrAp algorithm always identifies at least one solution; we also introduce an extension to our work that 
is suited for signals where each locus can be affected by distinct consecutive aberrations (e.g. several 
consecutive point mutations affecting the same nucleotide). In the second part, we simulate noisy aggre- 
gate signals constructed by random linear combinations of simulated subclonal aberration profiles. We 
used these simulated data to show that TrAp can correctly deconvolve mixtures of evolutionarily related 
subclones even in presence of levels of noise that are typically found in current genomics experiments. 
In the third part, we generated realistic aggregate signals by mixing subclonal genomic profiles obtained 
from cytogenetic analyses using random coefficients. We generate these data separately for each patient 
and show that, for nearly all aggregate samples, TrAp recovered the subclonal components. Similarly, 
we generated realistic aggregate signals from somatic hypermutated (SHM) regions from B cells. As we 
show, somatic hypermutation is a particularly suitable system for the framework of our TrAp algorithm, 
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which successfully recovered all components from the aggregate signals. In the fourth part, we apply 
our approach to exome-sequencing experiments. We present an analysis of recent single-cell exome se- 
quencing from a renal cell carcinoma study where, besides a collection of twenty single cells, an aggregate 
has also been measured. Despite the reported difference between the aggregate and mean aberration 
profile of the single cell experiments, TrAp could still identify subclones with co-occurring aberrations 
consistent with co-occurring aberrations found in direct single cell measurements. Finally, we analyze 
three metastases from separate body compartments of a melanoma patient and compare their inferred 
evolutionary patterns in a genomic region surrounding the DCC gene. 

The TrAp algorithm 

The subclonal deconvolution problem 

We consider a population of cells where each cell can be described by a binary vector, which we call 
genotype. Each element of the genotype vector has an aberration state modeled as a binary variable, 
e.g. the presence/ absence of a mutated nucleotide in a specific genomic position or the presence/absence 
of a specific copy number variation event in a specific locus. For each cell, the i-th element of the genotype 
vector is 1 if the i-th aberration is present in the cell and if the aberration is absent. A subclone is 
a collection of all cells that have identical genome-wide aberration profile. A subclone is populated in 
the sample if the fraction of cells sharing the subclone's genome is greater than zero and can be detected 
by the experiment. 

We define the subclonal deconvolution problem as the task of de-mixing an aggregate measure- 
ment into a linear combination of (unknown) subclonal genotypes. The only information that is required 
as input is the aggregate frequency vector y, whose elements yi correspond to the frequency of the 
i-th aberration in the sample cell population. For efficiency, we remove from the genome all genotype 
entries k that are homogenous within the population (i.e. y^ = or yu = 1), as they do not need to 
be deconvolved. Next, to ensure that the aberration-free non cancerous cells (wildtype) are included 
in the solution of the problem, we add one dummy aberration to all the normal and cancerous cells in 
the sample. By construction, the aggregate frequency of this dummy aberration y\ is equal to 1. Fi- 
nally, without loss of generality, we sort the aggregate frequency vector y in descending order such that 
1 = t/i > j/2 > ■ • • > Un > 0, where N is equal to the number of aberration events considered including 
the dummy aberration y\ . The subclonal deconvolution problem can be written in matrix notation as 

y = Cx, (l) 

where C is a matrix of size N x M whose elements Cjj are 1 if aberration i is present in subclone Cj and 
otherwise; N is the size of the vector y; M is the number of subclones; and x is a vector of size M 
where each element xj corresponds to the frequency of subclone Cj in the sample. We note that, without 
introducing the wildtype aberration, the wildtype subclone would correspond to a vector of zeros and we 
would not be able to infer the frequency of the wildtype component using the linear model of Equation (1). 
Furthermore, since the dummy aberration is present in the wildtype and all other subclones, it follows 
that (i)Vj, Cij = 1 and (ii) Xj = 1. We note that since M, C and x are all unknown in this problem 
there is an intractable number of possible solutions. As previously discussed [57], for M > 2 the system 
is underdetermined and the aggregate signal cannot be uniquely deconvolved by solving the linear system 
and it is not even possible to find parsimonious unique solutions using sparse reconstruction methods. 
However, by introducing additional biologically motivated constraints to the model, we can dramatically 
reduce the number of possible solutions, such that the problem may have a tractable number of optimal 
solutions, ideally only one. We therefore seek the family of solutions (TrAp-solutions) that sequentially 
satisfy the following constraints: 

1. Evolutionarity. The subclones are generated in an evolutionary process starting from a subclone 
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with no aberrations. Every subclone is generated from an existing subclone by adding to it a single 
new aberration event. 

2. Parsimony. The number of subclones M that are generated during the evolution process is 
minimal. 

3. Sparsity. The number of populated subclones P (i.e. the number of subclones j for which xj > 0) 
is minimal. 

4. Shallowness. The depth of the evolutionary tree (i.e. the number of generations) D = maxj Q2i c ij) 
is minimal. 

A schema of a TrAp solution is shown in Figure 1. 
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Figure 1. A schema of deconvolution of the mixed signal of four subclones. In this example, 
the aggregate signal frequency vector y on the left hand side of the matrix-vector equation represents 
the frequency of five aberrations in the aggregate sample. To allow the heterogeneous mixture of 
subclones to include normal cells we introduce a dummy aberration that is present in any cell. The 
frequency of the dummy aberration y\ is equal to one. The frequencies of the actual five aberrations 
A2, A3, A4, A5, and Aq encoded in the remaining elements of the vector Y are given by 7/2 = 0.6, 
2/3 = 0.4, 2/4 = 0.35, 2/5 = 0.3 and ye — 0.1, respectively. In this example, the optimal TrAp solution is 
unique and has four populated subclones: C2 with aberrations {A2}, C4 with aberrations {A2, A4}, Cg 
with aberrations {A 3 , A 5 } and Cg with aberrations {A 3 , A 6 }. The optimal solution is shown both as an 
evolutionary tree (top) and in matrix form according to Equation (1) (bottom), where the tree topology 
is encoded in the binary matrix and the relative composition of the subclones is represented in the 
column vector. 
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The evolutionarity constraint is used in many biological systems, in particular when studying devel- 
opment of cell populations [45-55]. The parsimony constraint is typically satisfied because the expected 
probability of a specific aberration event in a nucleotide or a locus is low and it is therefore unlikely 
that an event occurs more than once and independently in distant subclones. This constraint is the 
main criterion used to determine the optimality of maximum parsimony methods commonly used in 
phylogeny [48,56,58,59]. The parsimony constraint dramatically reduces the number of possible solu- 
tions of Equation (1) because it limits the number of subclones M. The sparsity constraint is justified 
by the fact that some subclones may die out or may be too rare to be detected. Also, it has been 
shown in several studies that few subclones acquire an evolutionary advantage and outgrow the other 
subclones [5,13,60,61], thus reducing the number of populated subclones. The shallowness constraint is 
justified as excessive genomic instability may not be viable, thus evolutionary trees tend to be shallow 
and wide rather than deep and narrow. 

Properties of TrAp solutions 

In this subsection we show that in the subclonal deconvolution problem the evolutionarity and parsimony 
constraint can always be satisfied by a naive model in which each aberration event occurs exactly once 
during evolution (i.e. M = TV). We call any solution with M = TV an TV-solution and its evolutionary 
tree an TV-tree. 

The TV-solution optimally satisfies the evolutionary and parsimony constraints. Since all detected 
aberrations need to occur at least once in the evolutionary tree, the number of subclones must be greater 
than or equal to the the number of aberration events considered, i.e. M > TV. We note that it is always 
possible to construct a valid solution for Equation (1) of exactly M = TV subclones for every aggregate 
frequency vector y. Specifically, for a cascade-like evolutionary process with no branching, where the 
wildtype subclone C\ is the root of the tree and every other subclone C\ is a direct descendant of the 
subclone Cj_i, the solution of Equation (1) is given by: 
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While this cascade-like solution satisfies both evolutionarity and parsimony constraints, this solution is 
not necessarily optimal with respect to the sparsity constraint and is the least optimal in terms of the 
shallowness constraint. Furthermore, the existence of this solution guarantees that there is always at 
least one solution to the subclonal deconvolution problem. 

Since we imposed that the four constraints to the subclonal deconvolution problem must be satisfied 
sequentially, any solution with M > TV subclones will always be less optimal than the solution given in 
Equation (2), regardless of its sparseness and shallowness. We can thus limit the search space of the 
TrAp algorithm to TV-solutions. For this subset of solutions, the vector x is of size TV and C is a square 
matrix of size TV x TV. Importantly, the index of each subclone Cj indicates the subclone for which the i-th 
aberration occurs for the first time. Hence, for any TrAp solution there is a one-to-one correspondence 
between the i-th aberration and the subclone Cj whose index indicates that it evolved from its parent 
subclone by acquiring the i-th aberration. Moreover, each aberration event of an TV-tree occurs exactly 
once and cannot be reverted during evolution. Each aberration i is thus present only in subclone Ci and 
its subclonal descendants: 

N N 
Vi ^ ^ ^ ] ^ij'Eji (^) 



G 



where is the ancestor indicator variable that is equal to 1 if subclone Cj is an ancestor of subclone 
Cj and otherwise. We note that Vj > 1, ot\j — 1, which means that the wildtype clone C\ is always the 
root of the evolutionary tree, as required by the evolutionarity constraint. 

In Equation (3), we expressed the aggregate frequency yi as the sum of the frequencies of all subclones 
descending from Cj. We now express the aggregate frequency yi as a function of the direct descendants 
of Cj. We define the parent indicator variable fa, which is 1 if Cj is the parent of Cj (i.e. if subclone 
Cj is the result of a single aberration event in subclone Cj) and otherwise. Finally, using the parent 
indicator variables we express in terms of the aggregate frequencies of the direct descendants of Ci 

N N / N \ N 

Vt=Xi+^2 a ij X J = X i+^2 hi I X 3 + ^2 a jkXk J = Xi + ^ fooVh ( 4 ) 
3 = 1 j=l \ k=l ) 3 = 1 

Equation (4) can be rearranged to express the vector x in terms of y and the parent indicator matrix 

x = y *y, (5) 

where 4? is the N x N matrix whose elements are given by the parent indicator variables tfiij. Because of 
the evolutionary constraint, the matrix <I> has only N — 1 nonzero elements, reflecting the fact that each 
subclone except the wildtype has exactly one parent subclone, i.e. X^jLi 4>n = an d Vj > 1, X^jLi 0»j = 
1. Furthermore, an important corollary of Equation (4) is that the subclone Ci is not populated if and 
only if 

N 

Vi-^faVj^Q- (6) 

j=l 

In other words, the clone Ci is not populated when the aggregate frequency yi of aberration i is equal 
to the sum of the aggregate frequencies of all the direct descendants of the subclone Cj. Therefore, the 
number of non-populated subclones of the TV-tree encoded by 3? is given by the number of aberrations i 
that satisfy Equation (6). 

Finally, we summarize the relationships between C and the indicator variables a and <ft. Using 
Equation (3), we can express the subclonal deconvolution problem as y = Cx = (I + A)x, where I is 
the N x N identity matrix and A is the N x N matrix of whose elements are given by the ancestor 
indicator variable a^ . Furthermore, Equation (5) allows us to write the subclonal deconvolution problem 
as x = (I — «&)y. We can therefore express the relationships between the matrices C, A and $ as: 

C = (I + A) = (I-*)- 1 . (7) 

A brute-force algorithm for solving the subclonal deconvolution problem 

We now observe that the relationship between two subclones Ci and Cj such that i < j (which also 
implies yi > yj as the vector y is sorted), must be one of the following: i) d is an ancestor of Cj, i.e. 
a>ij = 1, (Xji = and yi > yj; or ii) d and Cj are on separate branches, i.e. ay = 0, a*ji = and 
Hi + Vj 1- This property implies that all evolutionary trees can be generated iteratively by starting 
with the wildtype clone C\ and adding the clone Ci at step i to all trees generated at step i — In detail, 
for any tree that can be generated using subclones C\, . . . , d-i we generate a new tree by adding the 
subclone Ci as direct descendant of subclone Cj for all j < i for which the resulting Xj (calculated using 
Equation (5)) remains nonnegative after adding d. 

For completeness, when yi — yj the subclones Ci and Cj can be either on separate branches or on the 
same branch. If they are on the same branch, then there is an ambiguity regarding the order in which the 
two aberrations occur (Figure 2). However, in the case that these two subclones are on the same branch, 
the aberration profile of the ancestor subclone (shown in green in Figure 2) is not informative because this 
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subclone is not populated (x a = 0) and aberrations i and j are both present in the descendant subclone 
regardless of the order in which they occur. Since these two aberrations cannot be observed separately 
(i.e. the coefficient x a associated with the ancestor aberration is zero, whereas the Xd associated with 



both aberrations could be nonzero), we only output the solution for which C n 



is an ancestor of 



Cmax{ij} (left solution in Figure 2). This choice ensures that for every pair of aberrations i < j, Cj 
cannot be an ancestor of Ci . A step-by-step example of the TrAp solution obtained using the brute-force 
algorithm is given in Figure 3. 
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Figure 2. Deconvolution of a mixture where two aggregate signal frequencies are equal. In 

this example, five aberrations (A2, A3, A4, A$ and Aq) were measured from an aggregate sample and 
their frequencies were 2/2 = 0.8, 7/3 = 0.5, j/4 = 0.5, j/5 = 0.4 and ye = 0.2, respectively. The dummy 
measurement y\ — 1 was also added to generate the aggregate signal frequency vector 
y = [1, 0.8, 0.5, 0.5, 0.4, 0.2]. In this example, there are two optimal TrAp solutions (left and right), each 
shown both as an evolutionary tree (top) and in matrix form according to Equation (1) (bottom). Both 
solutions have 4 common populated subclones, namely C2 with aberration {A2}, C5 with aberrations 
{A2, A3, A4, A5}, Cq with aberration {Aq} and a subclone with aberrations {A2, A3, A4}. In both cases, 
the ancestors of the clone with aberrations {A2, A3, A4} (C3 of the left tree and C4 of the right tree) are 
not populated. We remark that these two solutions are practically indistinguishable and that the TrAp 
algorithm outputs only the solution where the subclonal indices along each branch are arranged in an 
increasing order (as shown in the left tree solution). 
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Figure 3. Brute-force search approach to deconvolve a mixture of four subclones. In this 
example, five aberrations (A 2 , A3, A 4 , A 5 and A 6 ) were measured from an aggregate sample and their 
frequencies were 2/2 = 0.8, 2/3 = 0.5, 2/4 = 0.5, 2/5 = 0.4 and ye — 0.2, respectively. The dummy 
measurement y\ = 1 was also added to generate the aggregate signal frequency vector 
y = [1, 0.8, 0.5, 0.5, 0.4, 0.2]. In the first step, the wild type clone C\ (representing the wildtype 
subpopulation) is positioned at the root of the tree. In the second step, a tree reconstruction begins and 
C2 is added to the only possible ancestor clone C\. In the third step, C3 must be added as direct 
descendant of C%. C3 cannot be added to the tree on a different branch as a direct descendant of C\ 
because based on Equation (5) this would imply a negative frequency X\ = t/i — 2/2 — Vz — —0.3. 
Likewise, the subclones C4 and C5 can only be added as direct descendants of C3 and C4, respectively. 
Finally, Cq can be added as a direct descendant of subclones Ci, C2 or C5 generating solutions S\, S2 
or S3, respectively. However, S\ is the only TrAp-solution as its number of populated subclones is 
minimal and corresponds the solution shown in the left side of Figure 2. Solution 53 is the cascade-like 
solution described in Equation (2). 



The upper bound on the number of possible evolutionary TV-trees is thus (N — 1)!, as every subclone 
i can only be the direct descendant of i — 1 subclones. This bound is significantly smaller than the 
number of all possible trees with N labeled vertices, which is N N ~ 2 according to Cailey's formula [62,63]. 
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Furthermore, we note that the parent indicator matrix <I> and C are upper triangular and that both C 
and I — $ are of rank TV and invertible, which guarantees that Equation (7) can always be used to switch 
between the representation with the parent indicator variable (Equation (5)) and the representation 
with the subclone matrix C (Equation (1)). We also note that, given a matrix <fr (or C = (I — the 
vector x is uniquely determined. In particular, if the C matrix is known, the vector x can be efficiently 
found by solving the linear system Cx = y using back-substitution (i.e. by solving Equation (3) first for 
xjv, then using xn to solve for xn-i and repeating through x\). 

TVAp: a fast algorithm for solving the subclonal deconvolution problem 

As we have shown, the number of non-populated subclones of a given iV-tree is equal to the number of 
aberrations i that satisfy Equation (6). Moreover, in order to satisfy the sparsity constraint of a solution, 
we do not need to know the topology of the whole evolutionary tree, but only the subset of rows of 
the matrix <& that satisfy Equation (6). We now leverage on this property to efficiently generate sparse 
solutions to the subclonal deconvolution problem. 

First, we group each subset of subclones that satisfy Equation (6) into a first generation tree 

Tj, which we define as the subset of subclones |Cp* , C^*, . . . , Cjf. | such that the subclone CJ* is not 

populated (i.e. Xp* = 0) and that iVj subclones Cf* , . . . , C]j. are its direct descendants. Each first 
generation tree is represented by a row of the $ matrix. For example, there are three first generation trees 
for the aggregate signal Y = {1,0.6,0.4,0.35,0.3,0.1}, namely 7\ = {C 1 ,C 2 ,C 3 }, T 2 = {C 1 ,C 2 ,C 5 ,C 6 } 
and T 3 = {C 3 , C5, C%] (Figure 4). We note that the optimal TrAp solution for this example contains the 
first generation trees T\ and T 3 (Figure 1). Furthermore, a $ matrix associated with a first generation 
tree must follow the evolutionary constraints previously described (V7 > 1, 4>ij = 1) an( i thus the 

first generation tree also gives complete information about the columns of 3? corresponding to the direct 
descendant subclones C\ l , . . . , Cjf. . For example, the first generation tree T\ — {Ci, C 2 , C3} implies that 
4>i.2 = and (/>j ; 3 = for every i ^ 1 (Figure 4). 

Next, we define a partial tree as a collection of first generation trees {Ti, . . . , T^} that can jointly 
be contained in a full evolutionary tree. Since each first generation tree can be represented by a row of 
the <fr matrix, a partial tree that is comprised of h first generation trees can be represented by h rows of 
the $ matrix. In the example above, the partial tree that contains the first generation trees T\ and T 3 is 
represented by rows 1 and 3 of the matrix 3? in the bottom panel of Figure 4. Similarly to first generation 
trees, the matrix <I? associated with a partial tree must follow the evolutionary constraint, which implies 
that not all combinations of first generation trees give rise to partial trees. In the example above, the 
partial trees 7\ and T 3 can be combined to generate the partial tree PT\ = {Xi,T 3 } (Figure 4 bottom), 
whereas the pairs {Ti,T 2 } and {T 2 ,T 3 } cannot be combined to generate partial trees. Therefore, in the 
example above the possible partial trees are PT\ = {Ti,T 3 }, the empty partial tree PT 2 = {} and the 
partial trees PT 3 = {Ti}, PT 4 = {T 2 } and PT 5 = {T 3 }. Moreover, we note that all TV-trees that contain 
a partial tree comprising of h first generation trees have N — h populated subclones. This implies that 
TrAp solutions, which must satisfy the sparsity constraint, must also contain one of the partial trees 
comprising the maximum number of first generation trees. In the example above, the optimal TrAp 
solution (Figure 1) contains the partial tree PTi, which is the only partial tree comprising two first 
generation trees. 

Since all TrAp solutions contain the maximum number of first generation trees, the TrAp algorithm 
dramatically reduces the search space by identifying the optimal partial trees and later utilizing them as 
starting points for rapidly building all the sparsest solutions to the subclonal deconvolution problem. In 
details, the TrAp algorithm solves the subclonal deconvolution problem in the following steps (Figure 5): 

1 . Identify all first generation trees from the aggregate signal vector y. 

2. Combine all first generation trees to generate all partial trees. 
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Figure 4. Identification of first generation trees. In this example, the aggregate signal frequency 
vector y = [1,0.6,0.4,0.35,0.3,0.1] is consistent with three first generation trees Ti = {Ci,C2,C3}, 
T 2 = {Ci, C2, C5,Cq} and T 3 = {C3, C5, Cq}. Each first generation tree is visualized as a matrix 
equation X = Y — $F according to Equation (5) (left) and as a partial evolutionary tree (right). In the 
bottom row, the partial tree PT\ given by the union of the partial trees Ti and T3 is shown. Question 
marks indicate values that are unknown as they are not specified by the first generation tree or by the 
partial tree. 



3. Discard partial trees that do not have the minimum number of populated subclones. 

4. Generate all evolutionary trees consistent with the partial trees comprising the maximum number 
of first generation trees. This step is done iteratively for each partial tree, similarly to the way 
described for the brute-force algorithm. The only difference is that, when the parent clone Cj* of 

a first generation tree Ti = < C£ i , C 1 i , . . . , C^. > is added to the tree, the subclones C 1 i , . . . , C^. 

are automatically added as direct descendants of Cj* . 

5. Optimize the shallowness constraint by sorting the generated solutions by the depth of the generated 
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tree. 



Step 1 : 

First generation trees 



C, (0%) 



C 3 (0%) 











C 2 (80%) 




C 6 (20%) 

A 6 







PT 2 ={T 1 } 



PT 3 ={T 2 } 



PT 4 ={T 1 ,T 2 } 



Step 2: 
Partial trees 



C, (0%) 



C 3 (0%) 



(50%) 




Step 3: 

Tree generation 
from PT A 



Put C : (and TJ 



Add C 3 (and T 2 ) 



Add C, 



C, (0%) 



C 6 (20%) 



C, (0%) 

C 3 (0%) 

A 2 , Ag 



C, (0%) 

>/\ 

I C 6 (20%) C 2 (30%) C 6 (20%) 

\ | A ? | | A * | 

C 3 (0%) 

A 2 , A 3 

C 4 (10%) 

A 2 , A 3 , A 4 



Figure 5. Illustration of the usage of first generation trees and partial trees for deriving 
the TrAp solution of a mixture of four subclones. In this example, five aberrations were 
measured from an aggregate sample and their frequencies were y 2 = 0.8, 2/3 = 0.5, 2/4 = 0.5, j/5 = 0.4 
and uq — 0.2, respectively. The dummy measurement y± = 1 was also added to generate the aggregate 
signal frequency vector y = [1,0.8,0.5,0.5,0.4,0.2]. In the first step, TrAp identifies all first generation 
trees, namely T\ = {Ci, C2, Cq} and T2 — {C3, C4}. In the second step, TrAp generates the possible 
partial trees, namely PTi = {}, PT 2 = PT 3 = {T 2 } and PT 4 = {T l7 T 2 }, and consequently selects 

only PT4 = {Ti,T 2 } as it is the only partial tree which contains a maximum number of first generation 
trees. In the third step, TrAp generates evolutionary trees starting from the partial tree 
PT4 — {Ti,T 2 }. To complete the evolutionary tree starting from PT4, the subclone C\ is positioned as 
the root of the tree. Since C\ is part of the first generation tree T\, the subclones C 2 and Cg are 
automatically added as a direct descendant of C\. Next, C3 is added as a direct descendant of C 2 . Since 
C3 is part of the first generation tree T 2 , C4 is automatically added as direct descendant of C3 . Finally, 
C5 is added as a direct descendant of C4, generating the optimal TrAp solution to the subclonal 
deconvolution problem. We remark that the optimal solution generated by the TrAp algorithm is equal 
to 51 in Figure 3 and to the left solution of Figure 2. 
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The performance of the TrAp algorithm is equivalent to the brute-force approach when there are 
no first generation trees (i.e. when all subclones are populated) and becomes superior to a brute-force 
approach when P < N. Furthermore, the algorithm can be modified by imposing additional constraints 
that need to be satisfied at each step of the iterative tree reconstruction procedure (an example of such 
modification is presented in the subsection on extension of TrAp to non-binary aberrations). The TrAp 
algorithm by default returns only the solution(s) that optimize all the constraints. In addition, the user 
can specify parameters to relax the sparsity and shallowness constraints. 



Generalization of TrAp to non-binary aberrations 

In the previous sections we represented the genome of each subclone by a vector of binary values whose 
entries represent if a genomic position is in a normal state (0) or in an aberrated state (1). In general 
the number of states in a given genomic position could be larger than two and hence subclones cannot 
be represented by vectors of binary values without loss of information. For example, a nucleotide found 
in the reference genome or in the germline at a specific position may undergo multiple distinct point 
mutation events into more than one specific nucleotide. In this subsection, we describe an extension of 
the TrAp algorithm to deal with such cases. 

We consider the generalized subclonal deconvolution problem in which the genome consists of 
N positions each of which can have S different states. We also assume that the genome of the wildtype 
subclone is known. The only information that we utilize as input is the aggregate frequency matrix 
Z whose elements z^^ correspond to the observed frequency of the aberrated state s at position k. We 
note that, by construction, < Zk, s < 1 and that X) s =i z ks = 1- To utilize our framework for solving 
the subclonal deconvolution problem, we convert the information encoded in Z as a vector y whose 
elements represent frequencies of binary events. We perform this transformation in several steps. First, 
we vectorize the matrix Z by concatenating its rows to construct the vector z, which has KS elements. 
Then, we remove from the vector z the entries for which Zks = as they are not informative. As a result, 
for each position k there are Sk entries, where Sk ranges from 1 (only the unmutated state is observed) 
to S (all aberrated states are observed). For illustration of these first steps, we consider a toy example 
of a genome of length three whose wildtype sequence is "CAT" and we analyze an aggregate sample 
made of three subclones with sequences "TCT", "TAC" and "CGT" mixed with frequencies 0.1, 0.3 and 
0.6, respectively (Figure 6). In this example the z vector consists of the elements z\c = 0.6, z\t = 0.4, 
z 2 a = 0.3, z 2 c = 0.1, z 2 g = 0.6, z 3C = 0.3 and z 3T = 0.7 (Figure 6). 

Next, we wish to design a binarization matrix B to encode the information contained in z as a vector 
y = Bz whose elements represent the frequency of binary aberrations and can thus be used as input 
to the subclonal deconvolution problem for the whole genome (Equation (1)). For every position fc, we 
assume that each state s (1 < s < Sk) is reached by a sequence of aberration events. We denote by Ak. s 
an aberration event to state s at position k. In the example above, there are two states at position 1. 
The unmutated state C is reached by a dummy aberration A\c (in analogy to the dummy aberration of 
the wildtype clone in the subclonal deconvolution problem) and the aberrated state T is reached by the 
sequence of the dummy aberration A\c followed by the aberration Ayr- Since the dummy aberration 
A\c is present when we observe states C or T at position 1, the frequency of the unmutated aberration 
is V\c = zic + z it = 1- However, the aberration Ait is present only when we observe the state T, 
therefore the frequency of the aberration Ait is yvr = zit = 0.4 (Figure 6). Since measurements at 
different genomic positions do not affect one another, we construct B as a block-diagonal matrix: 
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Figure 6. Application of the TrAp algorithm to deconvolve a mixture of three sequences 
in presence of poly-allelic mutations. We analyzed an aggregate sample composed of three 
subclones with sequences "TCT", "TAC" and "CGT" mixed with frequencies 0.1, 0.3 and 0.6, 
respectively. In this particular example, the wildtype sequence "CAT" is absent in the mixture. 
Nonzero elements of the aggregate frequency matrix Z (top right) are then concatenated in the z 
vector, which consists of the elements z\c = 0.6, zyr = 0-4, z 2 a = 0.3, z 2 c — 0.1, z 2 q — 0.6, 23c = 0.3 
and z%t = 0.7 (middle left). In the center of the middle panel we show the binarization matrix B^ 1 ' 1 ' 1 ) 
and the matrix-vector product B^ 1,1,1 ^ associated with it, which are consistent with Equation (8) and 
lead to the optimal TrAp solution. Next, rows and columns corresponding to unmutated states (i.e. 1C, 
2A and 3T, shown in green) are substituted with a dummy aberrated state corresponding to the 
wildtype (shown in blue). In the bottom row, the TrAp solution is shown both as an evolutionary tree 
(left) and in matrix form according to Equation (9) (right). 



where y k = [Vki, ■ ■ • , VkS k ] and zk = [z kl , . . . , z kSk ]. Combining Equation (1) and Equation (8) gives 

Bz = Cx. (9) 

It is important to note that for any pair of different states Si and s 2 at position k where 1 < Si < S k and 
1 < s 2 < Sk, the value of bk sit ks 2 defines the ancestral relationship between the aberration events A ksi 
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and Ak s . 2 . Using the ancestor indicator variable a we can express this relationship as ak Sl ,ks 2 — &fesi,fes 2 - 
To preserve these ancestral relationships in both sides of Equation (9) (we recall that C = I + A) , the 
element Ck Sl ,ks 2 01 C must be equal to the element bk Sl ,ks 2 01 B for every pair of states Si and S2 at 
position k, where 1 < k < K , 1 < s\ < Sk and 1 < s\ < Sk- 
in order to find the solutions to Equation (8), we solve independently each subproblem yk = BkZk and 
we require that the Bk matrix encodes a tree which satisfies the evolutionary and parsimony constraints 
and whose root is the dummy unmutatcd aberration at position k. The number of solutions for each 
subproblem is equal to the number of trees with Sk labeled vertices and, as discussed earlier, this number 
is equal to S^ k ~ 2 . We then denote by B^ k) the t k -th solution (1 < t k < Sf fc ~ 2 ) and by y£ k) the 
aggregate frequency vector associated to it. The solutions to the problem for 5 fc < 3 are: 



S k 
Sk 



1. There is only one solution. Since the input consists only of the unmutated state, it follows that 
z k = [1]. The solution is b£^ = [1] and the corresponding aggregate frequency vector is = [1]. 

2. There is only one solution. Assuming the input is z k = [zk\, ^2] and the unmutated state is fcl, 

and the corresponding aggregate frequency vector is = [1, z k2 ]. 



the solution is B 



(i) 



1 1 
1 



S k = 3. There are 3 solutions. Assuming the input is z k = [z k i, 2fc2> z k 3 ] 



the solutions are B[^ = 
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and B[ 3) = 



and unmutated state is kl, 
1 1 1 

10 and their 
1 1 



corresponding y k vectors are given by yj^ = [1,^2,^3], y k 2) = [Mfc2 + 2fc3, ^3] and y£ 3) = 
[1, Zk2, Zk2 + Zks\- The values bk2,k3 and bk3,k2 define the ancestral relationship between aberrations 
Ak2 and Ak3- In the first solution Ak2 and Ak3 must be on separate branches, in the second solution 
Ak2 must be an ancestor of Ak3, and in the third solution Ak3 must be an ancestor of Ak2- 

The number of solutions grows dramatically with Sk (16 solutions for Sk — 4, 125 solutions for Sk — 5, 
1296 solutions for Sk = 6). Therefore, herein we explicitly show the solutions for Sk < 4 and implement 
the method to adress practical scenarios, such as nucleotide point mutations (Sk < 4). 

The set of all possible solutions to Equation (8) is given by the Cartesian product of the solution 
sets of each subproblem. This implies that the number of solutions of Equation (8) is IlfcLi ^fc^ 2 - We 
denote by B^ 1 '"-'* 14 ) the block matrix solution associated with the blocks B* 1 , . . . ,B^ representing the 
solutions of each subproblem. Similarly, we denote by y(*i>— the aggregate signal vector given by 
y(ti,...,t K ) _ g(ti,...,t K ) z _ ft j s possible to reduce the number of rows and columns in Equation (9) by 
substituting all the K rows corresponding to the unmutated states (shown in green in Figure 6) with 
a single dummy wildtype aberration (shown in blue in Figure 6). In our toy example, one aberrated 
state is observed at positions 1 and 3 (Si = S3 = 2), while two aberrated states (C and G) are observed 
at position 2 (S2 = 3). Therefore, there are three solutions to Equation (8): B^ 1 ' 1 ' 1 ), B^ 1 ' 2 ' 1 ) and 
B(i,3,i) _ pjg ure 7 illustrates the best solution for each of these binarization matrices in a graphical and 
matrix form. The solution associated with B^ 1 ' 1 ' 1 ) is the only TrAp-solution of the generalized subclonal 
deconvolution problem since it has the minimum number of populated subclones (sparsity constraint). 

In summary the generalized subclonal deconvolution problem can be solved as follows: 

1. Vectorize the aggregate frequency matrix Z and identify all binarization matrices B^ 1 ' - '* 1 ^ (Equa- 
tion (8)) compatible with the vector z. 

2. For each binarization matrix B^ 1 '-"'* 1 ^ , identify all first generation trees from the aggregate signal 
vector yltivitK) = g(ti,...,t K ) z anc [ combine the first generation trees to generate all partial trees 
compatible with B^* 1 '-'* 11 ). 



3. Discard partial trees that do not have the minimum number of populated subclones. 
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Figure 7. Three solutions of the generalized subclonal deconvolution problem for a 
mixture of three sequences in presence of poly-allelic mutations. We analyzed an aggregate 
sample composed of three subclones with sequences " TCT" , " TAC" and " CGT" mixed with frequencies 
0.1, 0.3 and 0.6, respectively. In this example, the wildtype sequence "CAT" is absent in the mixture. 
Nonzero elements of the aggregate frequency matrix Z are concatenated in the z vector, which consists 
of the elements z\c = 0.6, z\t = 0.4, Z2A = 0.3, Z2c = 0.1, Z2G — 0.6, 2:3c = 0.3 and 23T = 0.7. There 
are three binarization matrices (B' 1 ' 1 ' 1 ), B^ 1,2 ' 1 ' and B^ 1 ' 3 ' 1 )) to Equation (8) and one solution for 
each binarization matrix is shown. Mutations are shown using the notation 

" position : reference — > mutated" , e.g. the notation 2:A^G indicates that nucleotide at position 2 was 
mutated from Adenine to Guanine. The notation 2 : A — > G — > C indicates that nucleotide at position 2 
was first mutated from Adenine to Guanine and then from Guanine to Cytosine. Top: solution based 
on the binarization matrix B' 1 ' 1 ' 1 ), in which the subclones C3 and C4 associated with the aberration 
events A2C and A2G are on separate branches; Middle: solution based on the binarization matrix 
Bl 1 . 2 . 1 )^ j n which the aberration event A2G (subclone C4) happens before of A2C (subclone C3; bottom: 
solution based on the binarization matrix B^ 1,3,1 ), in which the aberration event A2C (subclone C3) 
happens before A2G (clone C4). The solutions are shown both as evolutionary trees (left) and in matrix 
form according to Equation (9). 
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4. Generate all evolutionary trees consistent with the partial trees comprising a maximum number of 
first generation trees. This step is performed as described above, but with the additional constraint 
that Cksi,ks2 = bksi,ks2 for any pair of states si and S2 at position k, where 1 < k < K , 1 < s\ < Sk 
and 1 < s 2 < Sk- 

5. Optimize the shallowness constraint by sorting the generated solutions by the depth of the generated 
tree. 



Benchmarks using simulated data 

The models presented above show that TrAp is an efficient algorithm for inferring subclonal components 
from the aggregate measure. In particular we have shown that in the absence of noise TrAp returns the 
exact solution when the underlying subclonal population satisfies reasonable constraints and that the 
algorithm is always able to find at least one solution. However, experimental measurements are often 
noisy and can only have finite precision. In this section, we first discuss two approaches to treat noisy 
input and then we apply our algorithm to simulated aggregate signals from random in silico trees showing 
that TrAp is robust to typical noise levels found in genomic experiments. 



Handling Measurement Errors 

In order to accommodate different types of noise that may arise in genomic data, we incorporated two 
error models in the TrAp algorithm. In both error models, the input to the TrAp algorithm requires an 
additional vector e of size N whose elements are related to the precision of the measure yi. The error 
related to the dummy variable is denoted by E\ and is set to as yi = 1 is a constraint of the model and 
thus £i must vanish. 

First, we examine the bound error model in which we assign a threshold to the error £j > of every 
underlying aggregate signal j/i such that each measured signal yi will be in the range [y~i — £i,y~i + £i]. 
Equation (6) is then modified accordingly and we can state that the subclone Ci defined by aberration i 
is not populated if and only if: 



N 



N 



(10) 



Next, we implement a normal error model assuming normally distributed measurements errors 
using a confidence interval approach to determine whether a subclone is populated or not. Specifically, 
we assume that the underlying aggregate signal is normally distributed around the observed signal, 
i.e. yi ~ A/"(j/i,e|). We substitute each term of the left-hand side of Equation (6) by its normal 



distribution in order to derive the distribution r ~ N (/j r ,af) with mean \i T = yi 



j-i/j and 



variance of = ef + ( t ) ij £ 'j- Using the distribution of r and a desired confidence level a > (default 

0.05), we can define that clone Ci is not populated if falls within the confidence interval [qg. , g i-c ], 
where q a is the a quantile of the distribution of r. 

Once the error model is chosen, the algorithm generates all optimal TrAp-trees in a similar fashion 
to the noise-free case. The main difference is that in the first step of the TrAp algorithm, Equation (10) 
(or a confidence interval on the distribution r ~ J\f (/i r , of)) is used instead of Equation (6) to identify 
first generation trees. Moreover, instead of using back-substitution for finding the vector x, we solve 
the nonnegative linear least square problem Cx = y with the constraint Xk = for all non-populated 
subclones k associated with the parents of the first generation trees. This fitting allows us to obtain a 
value of exactly zero for all non-populated subclones and to distribute measurement error more evenly 
in the vector x. 
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Deconvolution of simulated noisy aggregates 

To confirm that TrAp can correctly infer the subclonal composition from aggregate noisy signals with 
typical noise levels found in genomic experiments, we performed simulations starting with random in 
silico evolutionary trees (see Methods) with different numbers of aberrations N and different numbers of 
populated subclones P. For each tree, we also studied the effect of different magnitudes of measurement 
errors e and we investigated the operating conditions for which TrAp would assign the best score to the 
true solution. 

For each aggregate signal from a random tree we examined: (i) whether the true solution (i.e. the 
solution associated with the simulated tree) , which is by construction one of the possible solutions to the 
subclonal deconvolution problem, had the minimum number of populated subclones among all solutions 
(sparsity constraint), (ii) whether the true solution had the minimum number of populated subclones and 
minimum number of levels of the evolutionary tree among all solutions generated by TrAp (sparsity and 
shallowness constraints), and (iii) whether the true solution was the only TrAp solution (sparsity and 
shallowness constraints and uniqueness of the optimal solution). 

The results of the simulations show that aggregate signals from sparse trees are deconvolved correctly 
even in presence of typical noise levels of sequencing experiments (Figure 8). We note that for simulations 
of non-sparse trees, TrAp generates a large number of possible solutions of which only one is the true 
solution. Furthermore, in the presence of high levels of noise, the TrAp algorithm identifies a large 
number of first generation trees that satisfy Equation (10) and generates solution trees whose number of 
populated subclones is smaller than the number of populated subclones of the true solution. 




Figure 8. Deconvolution of simulated data. We generated 1000 simulations (see Methods) for 
each combination of number of populated subclones (columns), number of mutations (rows). We 
performed this analysis using different level of noise (error) drawn from a uniform distribution IA {— e, e). 
The heatmaps (tables) show the percentage of trees in each cell in which the TrAp solution has the 
minimum number of subclones (left panel), is also the shallowest solution with the best score (middle 
panel) and in addition to these two conditions is also unique (right panel). If the best solution is unique. 
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Analysis of simulated mixtures of biological data 

Deconvolution of a mathematical mixtures of karyotyping data from single tumor biopsies 

After showing that our approach can correctly deconvolve aggregate signals of subclones with a tree-like 
genealogy, we sought to investigate whether actual subclonal populations can be charted on evolutionary 
trees. For this purpose we analyzed the Mitclman database, consisting of cytogenetic analyses of more 
than 60,000 biopsies (see Methods). For each tumor type we counted how frequently the relationships 
between cancer clones from the same biopsy could be explained by an evolutionary tree that follows the 
evolutionarity and parsimony constraints (but not necessarily the sparsity and shallowness constraints). 
We found that almost all biopsies in the Mitelman database can be represented by evolutionary trees 
(Table 1), with the notable exception of astrocytoma of grades III and IV (Figure 9). 

Table 1. Applicability of the TrAP algorithm for different number of aberrations events 
and underlying subclones. 
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10 


100% (174) 


100% (14) 


100% (2) 


n/a 


50% (2) 


11 


100% (196) 


100% (25) 


67% (3) 


67% (3) 


n/a 


12 


100% (156) 


100% (16) 


100% (3) 


0% (1) 


50% (2) 


13 


100% (137) 


100% (21) 


50% (2) 


n/a 


100% (1) 


14 


100% (94) 


100% (12) 


n/a 


100% (1) 


100% (1) 


> 14 


100% (152) 


100% (22) 


57% (7) 


25% (4) 


25% (4) 



We mathematically mix all karyotypes of each single patient from the Mitelman database and apply the 
TrAp algorithm for each of these mixtures. The ability of the TrAp algorithm to extract the correct 
underlying clonal or subclonal components depends on the number of actual components (columns) and 
the multiplicity of aberrations studied in each mixture (rows). The frequency in which TrAp is able to 
recover the correct underlying components is shown in percentages. The number of mixtures for a given 
size of aberration multiplex (row) and given number of actual underlying components (column) is 
shown in parentheses. Note that when the column index is greater than the row index (entries shown in 
bold) the parsimony constraint cannot be satisfied. 



Next, we investigated whether the TrAp algorithm could uniquely deconvolve mixtures of the cancer 
subclones within a biopsy. As these aggregate signals are obtained by mixing actual subclonal profiles, 
we consider these signals to be more realistic than our previous in silico simulations. For each biopsy, 
we generated in silico mixtures by combining the cytogenetic profiles of each subclone using random 
nonnegative coefficients (sec Methods). We then applied our TrAp method to deconvolve in silico mixtures 
of these cancer clones and found that TrAp could correctly deconvolve 99.7% of these realistic aggregate 
signals. However, we note that this task was trivial for a significant fraction of the biopsies whose number 
of aberrations and/or subclones is small. Moreover, the TrAp algorithm inferred also intermediate nodes 
in the evolutionary tree that did not correspond to any of the cytogenetic profiles for the biopsy, providing 
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Acute lymphoblastic leukemia/lymphoblastic lymphoma (4371) 
Chronic myeloid leukemia, t(9;22) (2216) 
Acute myeloid leukemia, NOS (2157) 
Acute myeloblasts leukemia with maturation (FAB type M2) (1921) 
Adenocarcinoma (1700) 
Acute promyelocytic leukemia (FAB type M3) (1 206) 
Chronic myeloid leukemia, aberrant translocation (963) 
Acute myelo monocytic leukemia (FAB type M4) (938) 
Follicular lymphoma (930) 
Multiple myeloma (841) 
Chronic lymphocytic leukemia (798) 
Diffuse large B-cell lymphoma (749) 
Meningioma (632) 

Acute myeloblasts leukemia without maturation (FAB type M1) (603) 
Burkitt lymphoma/leukemia (51 1 ) 
Acute monoblastic leukemia (FAB type M5) (463) 
Mantle cell lymphoma (431) 
Myelodysplastic syndrome, NOS (423) 
Adenoma (417) 
Mature B-cell neoplasm, NOS (376) 
Refractory anemia with excess of blasts (FAB) (372) 
Ewing tumor/peripheral primitive neuroectodermal tumor (366) 
Astrocytoma, grade III — IV (298) 
Lipoma (290) 

Bilineage or biphenotypic leukemia (283) 
Refractory anemia (277) 
Acute erythroleukemia (FAB type M6) (273) 
Squamous cell carcinoma (272) 
Acute megakaryoblastic leukemia (FAB type M7) (264) 
Leiomyoma (236) 

Acute monoblastic leukemia without differentiation (FAB type M5a) (232) 
Acute myeloblasts leukemia with minimal differentiation (FAB type M0) (21 2) 
Malignant melanoma (205) 
Chronic myelo monocytic leukemia (190) 
Chronic myeloproliferative disorder, NOS (174) 
Wilms tumor (169) 
Synovial sarcoma (148) 
Teratoma (mature and immature) (143) 
Acute monoblastic leukemia with differentiation (FAB type M5b) (120) 
Adult T-cell lymphoma/leukemia (HTLV-1+) (118) 
Splenic marginal zone B-cell lymphoma (117) 
Neuroblastoma (114) 
Juvenile myelomonocytic leukemia (109) 
Chondroid hamartoma (104) 
Primitive neuroectodermal tumor/Medulloblasioma (89) 
Polycythemia vera (88) 
Idiopathic myelofibrosis (87) 
Plasma cell leukemia (85) 
Refractory anemia with ringed sideroblasts (83) 
Liposarcoma, myxoid/round cell (83) 
Transitional cell carcinoma (80) 
Refractory anemia with excess blasts-2 (79) 
Anaplastic large cell lymphoma, systemic type (75) 
Peripheral T-cell lymphoma, unspecified (75) 
Mycosis fungoides/Sezary syndrome (72) 
Alveolar rhabdomyosarcoma (71) 
Carcinoma, NOS (70) 
Nonneoplastic epithelial disorder/lesion (69) 
Extranodal marginal zone B-cell lymphoma (66) 
Mature T- and NK-cell neoplasm, NOS (65) 
Benign epithelial tumor, special type (64) 
T-prolymphocytic leukemia (64) 
Refractory anemia with excess blasts-1 (64) 
Leiomyosarcoma (61 ) 
Oncocytoma (61 ) 
Malignant peripheral nerve sheath tumor/Triton (60) 
Retinoblastoma (58) 
Ependymoma (54) 
Chondrosarcoma, NOS (53) 
Combined germ cell tumors (53) 
Nonneoplastic hematologic disorder/lesion (53) 
Schwannoma (52) 
Hodgkin disease, nodular sclerosis (52) 
B-prolymphocytic leukemia (52) 
Osteosarcoma, NOS (51 ) 




1 2 3 4 5 6 7 8 9 10 11 12 13 >13 



Number of aberrations 



Figure 9. Applicability of the TrAp algorithm for different number of aberrations events 
and different kind of tumors. Each entry in the heat map shows the number of biopsies, which 
contained a given number of aberrations. Each cell is colored by the fraction of biopsies for which TrAp 
could reconstruct the correct composition of the subclones, from red (all biopsies could be 
reconstructed) to blue (no biopsies could be reconstructed). The constraints required by the TrAp 
algorithm are satisfied in most cancer types, with the exception of astrocytoma of grades III and IV. 
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a plausible picture of the evolutionary order in which the aberrations occurred. Figure 10 shows the result 
of two deconvolution simulations, one from a melanoma sample with 2 subclonal populations [64] and 
one from an adenocarcinoma sample with 3 subclonal populations [65]. An interesting, yet rare (only 5 
examples, shown in bold in Table 1), case of biopsies showed more clones than aberrations. Albeit a tree- 
like genealogical relationship can be constructed, these biopsies do not satisfy the parsimony constraint 
since the number of subclones M is greater than the number of mutations N. For this reasons, their 
genealogy cannot be reconstructed by the TrAp algorithm or by any other method that makes use of a 
similar parsimony constraint [48,56,58,59]. 



Melanoma 



Adenocarcinoma 



C 4 (27.9%) 




C 2 (72.1%) 

t(7;7), t(1;7),t(1;8) 
-11,t(10;11) 



C 1 (0.0%) 




C 3 (0.0%) 



-8, t(3;5), -6, -9 



C 2 (17.0%) 

-8, t(3;5), -6, -9, -14, -1,-4 
-11 



-8, t(3;5), -6, -9, -Y 



Figure 10. Deconvolution of random mixtures of three subclones. The boxes represent 
different subclones, each denoted by the list of its aberrations. The aberration profiles of two subclones 
identified by cytogenetics in a melanoma biopsy (left) and the aberration profiles of three subclones 
identified in an adenocarcinoma biopsy (right) have been mixed in silico using random coefficients. In 
both case, the mixtures were successfully deconvolved. Aberrations are grouped within the boxes 
according to the order of occurrence. The reconstructed evolutionary trees suggest intermediate (white 
boxes), probably rare, subclones that were not reported in the cytogenetic data. 



Deconvolution of a mathematical mixture of somatic hypermutation data with poly-allelic 
mutations in a single nucleotide 

Somatic hypermutation (SHM) introduces mutations that target the variable regions associated with 
immune adaptivity, i.e. Ig loci. In particular, SHM involves a programmed process of mutations that 
affects the variable regions of immunoglobulin genes and starts from an initial dividing single cell (a naive 
B cell in this case). All descendants of the founder cell accumulate mutations and, at the same time, are 
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subjected to a strong selective pressure. For this reason, SHM is a particularly good biological model 
system to test our deconvolution method, which imposes tree-like evolutionary constraints. 

We considered a dataset where a total of 20 mutated nucleotides in the V(D)J region of the Ig locus 
were measured in 8 sequences extracted from the same germinal center (see Methods) [66]. This dataset 
was particularly interesting because of the high number of mutations found and because of the presence 
of poly-allelic mutations. 

We mathematically mixed the multi-subclonal data and applied our TrAp algorithm taking into 
account that the SHM scenario consists of non-binary aberrations. In all simulations, TrAp was able to 
recover the original sequences and the solution was unique in 65% of the simulations. The TrAp-solution 
of one simulation is shown in Figure 11. However, even if the solution was not always unique, in more 
than 97% of the simulations there were only five or less candidate solutions satisfying the evolutionary, 
parsimony and sparsity constraints, all of which correctly identified six out of seven subclones. 

In addition to the identification of the underlying subclones, the TrAp algorithm generates evolu- 
tionary trees, which represent the B cell lineage during the somatic hypermutation process. The recon- 
struction of B cell lineage can provides important insights into the mechanisms that regulate adaptive 
immunity. B cell lineage reconstruction is generally performed using maximum parsimony constraints [56] 
using the sequences of several Ig loci as input. However, in contrast to these approaches, the TrAp algo- 
rithm is able to generate maximum parsimony trees when only the relative frequency of mutations at each 
nucleotide is available. Therefore, the TrAp algorithm can be used to generate parsimonious evolutionary 
trees when only partial sequence information is available, e.g. when only short read sequences from a 
single aggregate sample are available, or when the loci analyzed span a region that is too large to be fully 
sequenced. 

Analysis of tumor biopsies 

Comparison between subclonal aberration profiles inferred from heterogeneous cell popu- 
lations and singl-cell aberration profiles 

We analyzed data from a recent study on renal cell carcinoma where two aggregate samples and twenty 
single cells were isolated from a clear cell renal cell carcinoma (ccRCC) and subjected to exome sequencing. 
Interestingly, the original study only showed partial similarity between the single cells and the aggregate 
[24]. However, since the single cells and the aggregate used in the experiments are from the same tumor, 
we sought to investigate whether any subclones inferred by TrAp would share a similar combination of 
mutations found in any of the single cells. 

We applied our TrAp algorithm to the aggregate sample and obtained an evolutionary tree consisting 
of three main subclones. Due to the lack of extensive validations, we limited ourselves to investigate 
whether mutations that co-occur in the TrAp-solution also co-occur in single cell samples. We considered 
the mutations that were validated by bioinformatics analysis (Table S3A from Xu et al. [24]) and by 
PCR validation (TableS3B). The fraction of correctly estimated co-occurrences was 0.76 for mutations 
validated by bioinformatics analysis and 0.74 for mutations validated by PCR. 

Melanoma data 

Finally, we applied our algorithm to investigate evolutionary mechanisms in tumor metastases using 
exome sequencing data from three tumor metastases (TM1: left lateral chest wall, TM4: pleural cavity, 
and TM3: right axilla) a matched normal sample (N: left lateral chest wall) of one melanoma patient [67]. 
TrAp can efficiently handle aggregate signal vectors of about 20 unique frequencies and therefore we 
perform deconvolution analysis only on one chromosome. We selected chromosome 18 as it contains 
the tumor suppressor Deleted in Colorectal Cancer {DCC) gene, which is known to exhibit a high load 
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Intermediate clone (0.0%) 

170:A-»G, 349:A-»T 



Sequence 6 (8.5%) 
34:A-»T, 170:A->G, 297:A- 
347:G--A, 349:A--T 



Intermediate clone (0.0%) 
158:G->C, 170:A-»G 
349:A--T 



Sequence 7 (10.9%) 
158:G-C, 170:A->G 
339:A-T, 349:A-T 



Intermediate clone (0.0%) 

158:G->C, 170:A->G--C 
349:A->T 



Sequence 2 (22.6' 
29:A-C, 158:G-»C 

170:A-»G->C, 213:A^C 
228:C->T, 229:C^A 
233:C-»A, 237:A->G 
349:A-»T 




Sequence 3 (20.2%) 

45:G-A, 158:G^C 
170:A^G->C, 349:A-»T 



Intermediate clone (0.0%) 

145:G->C, 158:G->C 
170:A-G-C, 349:A->T 



Sequences 5 and 8 (21 .2%) 

29:A-»C, 158:G-«C 
170:A-»G->C, 211:A->G 
213:A-C, 228:C->T 
229:C->A, 233:C->A 
237:A-»G, 349:A->T 




Sequence 4 (7.9%) 
145:G->C, 158:G->C 
170:A->G->C, 180:A->G 
349:A-»T 



Sequence 1 (8.8%) 

105:T->G, 145:G->C 
157:A->G, 158:G-C 
170:A->G-»C, 202:A->T 
349:A->T 



Figure 11. Deconvolution of a random mixture of eight sequences from somatic 
hypermutation data. Eight sequences from the Ig locus of eight cells extracted from the same 
germinal center were mixed with the random coefficients given by 

x = [8.8%, 22.6%, 20.2%, 7.9%, 5.7%, 8.5%, 10.9%, 15.5%]. Since sequences 5 and 8 are identical, they are 
grouped in a single clone whose relative frequency is 5.7% + 15.5% = 21.2%. In total, twenty mutated 
nucleotides were found in the data and two different mutations were identified at position 170. 
Mutations are shown using the notation "position: reference mutated", e.g. the notation 170: A— >G 
indicates that the nucleotide at position 170 was mutated from Adenine to Guanine. The notation 
170: A— >G— >C indicates that the nucleotide at position 170 was mutated twice, first from Adenine to 
Guanine and then from Guanine to Cytosine. In this example, all seven subclones were correctly 
deconvolved by the TrAp algorithm, the frequency of the subclones was correctly estimated and the 
solution was unique. 



of mutations only in melanoma [68] , in contrast to low expression, loss of heterozygosity or epigenetic 
silencing in other tumors. 

To apply the TrAp algorithm, we first preprocessed the data and selected 19 mutations on chromosome 
18 (see Methods). We labeled each mutation according to the gene affected and the amino acid change 
caused by the mutation (e.g. the label DCC.L1099H indicates a mutation in the DCC gene that causes a 
mutation from a Leucine to a Histidine at position 1099 in the DCC protein). There are six mutations with 
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> 99% frequency in all samples (including the normal): ADNP2.G54G, ALPK2.I2157V, CD226.S307G, 
DCC.F23L, NET01.S481N, and SLC39A6.E119D. The only other mutation found in the normal sample 
was TCEB3CL.S301C, which occurs with frequency > 90% in all samples. Moreover, the mutations 
ALPK2.R136S, CHST9.S122N, FAM38B.V2463, LAMA1.S1577A, LAMA1.K2002E, MYOM1.T215M, 
SERPINB10.R246C and SLC14A2.A880T were found in all three tumor samples and shared a similar 
frequency profile. The mutations DSC3.A28D, DSG1.M11V and IMPACT. D125E were found only in 
the metastases samples TM3 and TM4 and shared a similar frequency profile. Finally, the mutation 
DCC.L1099H was found only in the sample TM3. 

Independent runs on the three metastatic samples gave 33 optimal solutions for TM1, 222 for TM3 
and only 1 TrAp-solution for TM4. These high number of solutions is due to the substantial noise of 
the experiment (in the range 0.005 — 0.025) and the fact that in samples TM1 and TM3, two of the 
subclones have similar frequencies and are thus difficult to separate from one another. However, TrAp 
identified an unique solution in the sample TM4, where the three populated subclones are distributed 
with significantly different frequencies (Figure 12 middle). Next, we reasoned that the metastatic TM1, 
TM3 and TM4 samples may share common ancestors and that their evolutionary profiles may be related. 
We then applied our TrAp algorithm while also imposing that all evolutionary trees must be a subset of 
one global evolutionary tree. This approach returned an unique solution for each sample, all of which 
were among the solution sets identified in the previous analyses. We observe that this approach can be 
very powerful since, in principle, it allows the reconstruction of very large trees by combining several 
snapshots of the related subclonal populations. 

The results of the deconvolution are shown in Figure 12. We observe the presence of two main 
branches. Mutations in the left branch of TM4 (77%) are more abundant than in TM1 and TM3 
(48%). We note that LAMA1, a protein that is involved in cell adhesion, is present in the right branch. 
44% (19%) of the subclones of TM3 (TM4) have mutations in DSC3, DSG1 and IMPACT. The TM3 
subclone also acquires a second mutation in the DCC gene (DCC.L1099H-L) in addition to the mutation 
DCC.F23L, which was hereditary. The novel mutation in the DCC gene occurs close to the boundary 
between the extracellular domain and the transmembrane domain of the protein product. The resulting 
Histidine amino-acid is positively charged, opposed to the Leucine amino-acid of wildtype, which is 
neutrally charged. Because this change is next to the cell membrane, it may have repercussions on the 
functionality of the DCC protein product, perhaps causing inactivation, similar to the inactivation caused 
by loss of heterozygosity and transcript suppression observed in other cancer types. 

Discussion 

In the present study we presented the TrAp algorithm, a tool for inferring subclonal composition and 
abundance from a single aggregate measurement experiment. As we have shown, TrAp is robust to noise 
and it can deconvolve mixtures where multiple mutations occurs at the same locus. TrAp efficiently 
enumerates all possible solutions that are compatible with the model constraints, thus always identifying 
the sparsest and most parsimonious solution(s). However, TrAp will also generate trees (cf. Equation 2) 
in cases where no tree structure can be inferred. As we have shown, such structures, while deviating from 
the true underlying population structure, can still capture relevant co-occurrences of mutations that are 
specific to certain subclones. Further, in contrast to parsimonious neighbor-joining approaches, which 
rely on sampling single subclones from the population (e.g. single cell experiments), TrAp utilizes one 
single aggregate experiment as input, thus overcoming the issue of small sampling size, which may be 
insufficient to cover the whole spectrum of subclones in a sample. We successfully deconvolved systems 
of up to 25 aberrations. Although this number may not always be large enough to consider all somatic 
mutations found in a tumor sample, this problem can be circumvented by clustering aberrations with 
similar frequencies before running the TrAp algorithm. 

The level of characterization achieved by subclonal deconvolution holds high potential for person- 
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Tumor Metastasis TM1 

left lateral chest wall 



Tumor Metastasis TM4 

pleural cavity 



Tumor Metastasis TM3 
right axilla 



(6%) 

ADNP2.G54G, ALPK2.I21 57V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E119D 



C, (4%) 

ADNP2.G54G , ALPK2. 12157V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E1 19D 



(8%) 

ADNP2.G54G , ALPK2. 12157V 
CD226.S307G, DCC.F23L 
NET01.S481N, SLC39A6.E119D 



2 < 46% ) 
53CLS301C 



C 3 (48%) 

ALPK2.R136S, CHST9.S122N 
FAM38B.V2463I , LAMA1.S1577A 
LAMA1.K2002E, MYOM1.T215M 
SERPINB10.R246C, SLC14A2.A880T 




C, (77%) 



ALPK2 R136S- CHST9.S122N 
FAM38B.V2463I . LAMA1.S1577A 
LAMA1.K2002E, MYOM1.T215M 
SERPINB10.R246C, SLC14A2.A880T 



C 4 (19%) 



DSC3.A28D, DSG1.M11V 
IMPACT.D125E 



C 2 (0%) 

TCEB3CL.S301 C 



C 3 (48%) 

ALPK2.R136S, CHST9.S1 22N 
FAM38B.V2463I , LAMA1.S1577A 
LAMA1.K20O2E, MYOM1.T215M 
SERPINB10.R246C, SLC14A2.A880T 



C 4 (0%) 

DSC3.A28D, DSG1.M11V 
IMPACT.D125E 




Figure 12. Evolutionary trees inferred from three metastases of a melanoma patient. Each 
subclone in these trees is represented by a box with a list of mutations that includes only its new 
mutations (ancestral mutations can be read off by tracing back the mutation lists of all of its ancestors). 
Mutations are labeled according to the gene affected and the amino acid change caused by the mutation 
(e.g. the label DCC.L1099H indicates a mutation in the DCC gene that causes a mutation from a 
Leucine to a Histidine at position 1099 in the DCC protein). Highly expressed genes from this patient 
are indicated in bold. Mutations in the left branch of TM4 are more abundant than in TM1 and TM3. 
44% (19%) of the subclones of TM3 (TM4) have mutations in DSC3, DSG1 and IMPACT. The TM3 
subclone has an additional mutation in DCC. 



alizcd therapies. Possible applications include the classification of subclones in primary tumors, the 
identification of the seeds of metastases, tracing of resistant subclones especially after drug treatments 
and developing treatment strategies to eliminate resistant subclones. Furthermore, our proposed model 
can be applied to other medical problems, such as tracing bacterial or viral paths of adaptation within 
the infected host, detailed genome- wide reconstruction of the epigenetic differentiation program, or class 
specification in the hematopoietic system or in other systems. 



Methods 

In this section, we describe the procedures used to preprocess input data for the TrAp algorithm. 



Data processing and generation 
Random in silico evolutionary trees 

We performed simulations by sampling genotypes whose size N ranged from 1 to 12 and with underlying 
number of populated subclones P ranging from 1 to N—l. The simulations were repeated for measurement 
error values e equal to 10~ 2 , 10~ 3 and 10 -4 . For each combination of these quantities, we performed 
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1000 runs using in silico generated data as follows: during each run, a random evolutionary tree with 
N aberrations was generated. The set of P populated subclones was then selected by first including 
all leaves of the generated tree and then adding the remaining subclones randomly. The frequency of 
the populated subclones was randomly assigned and the frequency of the non-populated clones was set 
to zero. Next, the aggregate frequency vector y was calculated from the generated tree. Finally, we 
perturbed each element of y, by adding an error £j drawn from a uniform distribution U (— e,e). The 
elements y, = y~i + Si and the error Si are used as input for the bound error model option of the TrAp 
algorithm. 

Cytogenetic data 

The cytogenetic data was obtained from the Mitelman database, which contains 61,846 biopsies as of 
August 15, 2012. We accessed the database through the CGAP Website [69] and we filtered out 29,842 
biopsies with uncertain calls (indicated by a "?" in the karyotype data). We focused our search only 
on aberrations that are binary by nature, namely chromosome deletions and translocations. For each 
biopsy, we performed 100 in silico simulations in which we mixed the subclones using random nonnegative 
coefficients. 

Somatic hypermutation data 

SHM sequencing data was derived from B cells undergoing somatic hypermutation (SHM), a process that 
leads to high-affinity antibody molecules [70]. In details, we analyzed sequences of the V(D)J region 
extracted from the same germinal center from the sample 11930dl6-4-print.2, which was sequenced by 
Anderson et al. [66,71]. The sequences were aligned using the IMGT alignment tool [72,73] in order 
to properly align the V, D and J regions. We selected 8 sequences that were aligned to the same 
V(D)J sequence (Vi(-Di)Ji). Since these sequences are from the same germinal center and align to the 
same V(D)J sequence, they are expected to stem from a single naive B cell and have evolved through 
the somatic hypermutation process. From the sequencing experiment, 20 mutated nucleotides were 
identified. Furthermore, a polyallelic mutations were found at position 170, where both A — > C and 
A — > G mutations were observed. Next, we considered the 7 unique sequences (sequences 5 and 8 were 
identical) as representatives of the genome of 7 different subclones. Similarly to the previous application, 
we mixed these subclones using random nonnegative coefficients and performed 1000 simulations using 
random nonnegative coefficients. 

Exome capture sequencing data 

Exome-capture data [74] was obtained from a recent clear cell renal cell carcinoma (ccRCC) study [24] 
and from the melanoma patient YUHUY of the Yale cohort, for which DNA from normal circulating 
lymphocytes and three metastases (TM1, TM3 and TM4) were subjected to exome-capture Illumina 
Hi-Scq sequencing [67]. 

Exome-Seq reads from the aggregate samples of the ccRCC patient were combined and aligned to 
the human reference genome (assembly hgl9) using Bowtie [75] with parameters "-n3 -kl -m20 -132 — 
chunkmbs 1024 —best —strata" . The frequency and coverage of each point mutation was computed using 
VarScan [76]. We further selected the mutations that were validated by Xu et al. [24] by PCR validation 
(TableS3B) and by bioinformatics analysis (Table S3A ), whose genomic coordinated were realigned to 
the assembly hgl9 using the Lift-Over tool of Galaxy [77]. 

For the melanoma patient YUHUY [67], we selected 19 mutations that were homozygous in the normal 
sample (i.e. y w or y ss 1), had high sequence coverage (i.e. more than 200 reads covering the specific 
nucleotide) and were localized on chromosome 18. 
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Since none of the genomic positions analyzed contained poly-allelic mutations, we assign a binary 
state (normal/mutated) to each selected genomic position and we estimated the aggregate signal and 
measurement error for each mutation event using a normal approximation as 



where m is total the number of reads covering position i and is the number of reads covering position i 
where the nucleotide i is mutated. Finally, the y and e vectors were used as input for the TrAp algorithm, 
which was set to use the normal error model. 

Implementation of the TrAp algorithm 

TrAp was programmed in Java. TrAp makes use of the Java Matrix package [78] for linear regression and 
code by Josh Vermaas to solve the nonnegative least square problem using JAMA. The Java Universal Net- 
work/Graph Framework (JUNG) [79] is used for creating pictorial representations of evolutionary trees. 
TrAp is released under the GNU Lesser General Public License 3.0 and can be downloaded on the Source- 
Forge repository of Kluger's lab at the URL http://sourceforge.net/projects/klugerlab/files/TrAp/. 
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